Objective to Predict Completed Appointments Among All Inpatients

The objective of model was predict the probability of an inpatient not completing their appointment. The training data for the model was developed from the MUSC inpatient data pipeline. Details are in the repository readme. A combination of semi structured and numeric data from updated each night including diagnosis, problem lists, and therapeutic classes was used to create dependent variables.

Modeling Approach

The text data was run through a tokenizer to build a bag of words representation, with stop words removed. The modeling technique used was gradient boosted tree via the xgboost package. The data was split by putting 75% of the Hars into training and 25% to the test set. The models were build using two optimizers, to maximize AUC and minimize missclassification log loss. L1 and L2 regularization grid search was used to optimize the model

Variables Considered

The following variables were used to build the model. The text columns are semi-structured and put into a bag of words model. Catagorical columns were one hot encoded. All numeric variables were zero imputed when missing.

{'cat_cols': ['AppointmentTypeCategoryDescription', 'AppointmentTypeDescription', 'PatientMarketingRegion', 'PatientSex', 'PatientStateCode', 'PayorFinancialClass', 'SchedulingProviderReportingSpecialty', 'ClinicalDepartment', 'DepartmentSpecialty', 'DepartmentICCE'], 'zero_imputer_cols': ['AppointmentHourOfDay', 'AppointmentMonth', 'ScheduledLagDays', 'ScheduledMinutes', 'CopayDue', 'PatientAge', 'ReferralFlag', 'CompletedAppointmentPercentageLast12', 'PatientCancelPercentageLast12', 'ProviderCancelPercentageLast12', 'FromSCFlag', 'AppointmentDayOfWeekNumber', 'AppointmentDayOfMonth', 'ACOPatientPopulationFlag', 'AppointmentsLast12', 'CompletedAppointmentsLast12', 'NoShowAppointmentsLast12', 'PatientCancelledWithin24HoursAppointmentsLast12', 'CompletedEstablishedPatientAppointmentsLast12', 'CompletedNewPatientAppointmentsLast12', 'CompletedNurseOrAncillaryAppointmentsLast12', 'CompletedOtherAppointmentsLast12', 'CompletedProcedureAppointmentsLast12', 'AllCancelledAppointmentsLast12', 'AllCancelledWithin24HoursAppointmentsLast12', 'PatientCancelledAppointmentsLast12', 'ProviderCancelledAppointmentsLast12', 'ProviderCancelledWithin30DaysAppointmentsLast12', 'OtherCancelledAppointmentsLast12', 'ScheduledMinutesLast12', 'CompletedMinutesLast12', 'CheckInToApptLast12', 'CheckInToRoomLast12', 'ApptToRoomLast12', 'RoomToNurseLeaveLast12', 'RoomToProvEnterLast12', 'NurseLeaveToProvEnterLast12', 'ProvEnterToVisitEndLast12', 'RoomToVisitEndLast12', 'ApptToVisitEndLast12', 'VisitEndToCheckOutLast12', 'CheckInToCheckOutLast12', 'AvgScheduledMinutesLast12', 'AvgCompletedMinutesLast12', 'AvgCheckInToApptLast12', 'AvgCheckInToRoomLast12', 'AvgApptToRoomLast12', 'AvgRoomToNurseLeaveLast12', 'AvgRoomToProvEnterLast12', 'AvgNurseLeaveToProvEnterLast12', 'AvgProvEnterToVisitEndLast12', 'AvgRoomToVisitEndLast12', 'AvgApptToVisitEndLast12', 'AvgVisitEndToCheckOutLast12', 'AvgCheckInToCheckOutLast12', 'ArrivalTimelinessLast12', 'AverageLagDaysLast12', 'CopayDueLast12', 'CopayCollectedLast12', 'DistinctProvidersLast12', 'DistinctProviderSpecialtiesLast12', 'DistinctDepartmentsLast12', 'DistinctDepartmentSpecialtiesLast12', 'InpatientDischargesLast12', 'AverageLOSLast12', 'AverageCMILast12', 'InpatientChargesLast12', 'InpatientPaymentsLast12', 'InpatientAdjustmentsLast12', 'EDVisitsLast12', 'TotalMinInEDLast12', 'AvgMinInEDLast12', 'EDObservationVisitsLast12', 'EDLeftWithoutSeenVisitsLast12', 'EDBehavioralHealthVisitsLast12', 'EDInpatientAdmissionsLast12', 'EDReadmitBounceBackIn72HoursLast12', 'PBDepartmentsLast12', 'PBDepartmentSpecialtiesLast12', 'PBClinicalDepartmentsLast12', 'PBPerformingProvidersLast12', 'PBPerformingProviderSpecialtiesLast12', 'PBServiceCodesLast12', 'ASAUnitsTotalLast12', 'PBServiceCountLast12', 'PBServiceUnitsLast12', 'PBChargeAmountLast12', 'PBWorkRVUsLast12', 'PBAdjustmentsLast12', 'PBBadDebtWriteOffLast12', 'PBChartiyWriteOffLast12', 'PBContractualWriteOffLast12', 'PBDiscountsLast12', 'PBPaymentsLast12']} number of features : 100

Holdout Set

testing rows : 1755276 training rows : 4095643

AUC of the ROC Curve

In a ROC curve, the true positive rate (Sensitivity) is plotted in function of the false positive rate (Specificity) for different cut-off values. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.

Sensitvity and Specificity Curves

Sensitivity (true positive rate) is the proportion of cases where the model predicts positive given that the target variable is present. Specificity (true negative rate) is the proportion of cases where the model predicts negative given that the target variable is absent.

Predicted Probability Distribution

The distribution of model prediction is shown below. The red line represents the proportion of cases where the target variable is present.

Accuracy vs. Cutoff Plot

This plot displays the trade-off between cutoff value and accuracy. Optimal cutoff here is found by maximizing accuracy of model. This is not ideal with problems that have class imbalance; however, optimal cutoffs can be chosen when intervention costs are known.

Model Metrics using Using Above Cutoff

Sensitivity (true positive rate) is the proportion of cases where the model predicts positive when the target variable is present. Specificity (true negative rate) is the proportion of cases where the model predicts negative when the target variable is absent. Positive predictive value is the proportion of cases where the target variable is present when the model predicts positive. Negative predictive value is the proportion of cases where the target variable is absent when the model predicts negative.

Classification Report

using cut off :  0.49
             precision    recall  f1-score   support

          0       0.87      0.98      0.92   1462373
          1       0.69      0.26      0.38    292903

avg / total       0.84      0.86      0.83   1755276

Variable Importance

The variable importance plots display the most important features by their information gain, which is a measurement that sums up how much 'information' a feature gives about the target variable. Information gain measures the reduction in entropy, or uncertainty, over each of the times that the given feature is split on.

SHAP Univariate Plots

SHapley Additive exPlanations(SHAP) is an approach to explain the output of machine learning models. SHAP assigns a value to each feature for each prediction (i.e. feature attribution); the higher the value, the larger the feature’s attribution to the specific prediction. In cases of classification, a positive SHAP value indicates that a factor increases the value of the model's prediction(risk), whereas a negative SHAP value indicates that a factor decreases the value of the model's prediction. The sum of SHAP values over all features will approximately equal the model prediction for each observation. In the following plots blue points signify negative SHAP values, red points have positive SHAP values, and yellow points are values for which the feature attributes little to predicted value.

SHAP Bivariate Plots
SHAP Summary Plot Over Categorical/Text Columns

The following plot display the effect of the top text/category values on the Model's predictions. Red data points represent when the text/column feature is observed in a patient's records, whereas the blue represents when the feature is not observed.